-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate 1D pandas object handling in as_column #14394
Consolidate 1D pandas object handling in as_column #14394
Conversation
if cudf.get_option("mode.pandas_compatible"): | ||
raise NotImplementedError("not supported") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: What happens to this type of data if we're not in pandas-compat mode? And why is it not supported if we are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens to this type of data if we're not in pandas-compat mode?
We currently convert this correctly to a corresponding cudf type
In [1]: import cudf; import pandas as pd
In [2]: cudf.from_pandas(pd.Series([1], dtype=pd.Int64Dtype()))
Out[2]:
0 1
dtype: int64
And why is it not supported if we are?
We disallowed it for cudf.pandas
since currently we cannot roundtrip back to pandas correctly since cudf doesn't keep track whether a e.g. numpy.int64 or pandas.Int64Dtype was passed in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, makes sense, thanks.
# float16 | ||
arbitrary = arbitrary.astype(np.dtype(np.float32)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: Because cudf doesn't support float16 natively? Does this potentially cause problems in cudf-pandas mode since the dtype will not be preserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I believe so I was migrating this code below here
arb_dtype = (
cudf.dtype("float32")
if arbitrary.dtype == "float16"
else cudf.dtype(arbitrary.dtype)
)
pandas also does not support float16
(which was made more intentional by raising in pandas 2.0, before it would sometimes coerce to float32
also), so I don't think the dtype preservation here isn't too much of an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion:
Given that the pandas-2 behaviour raises, let us take this opportunity to add a deprecation warning here so that we can also raise once we're supporting pandas-2. (See https://docs.rapids.ai/api/cudf/nightly/developer_guide/contributing_guide/#deprecating-and-removing-code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah so my generalization about float16 in pandas 2.0 isn't entirely correct.
Only pandas Index objects will disallow float16 (IIRC there's no hashtable implementation for this type), but Series and DataFrame objects will continue to allow float16.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but you can't do many things with float16 series (e.g. merge doesn't work).
I think I would rather raise
here (as we do at the moment for non-pandas data):
import cudf
cudf.Series([1, 2, 3], dtype="float16")
# TypeError: Unsupported type float16
Rather than silently upcasting. WDYT? I realise this would be a breaking change (since we currently do upcast in from_pandas
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah agreed raising here is better than silent upcasting, I'll make this raise and mark as a breaking change
raise TypeError( | ||
f"Cannot convert a object type of {inferred_dtype}" | ||
) | ||
# TODO: nan_as_na interaction here with from_pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we open an issue documenting the various TODOs in data ingest so that we have an overview of what is still in-progress? It's unclear to me how much of this needs new implementation in cudf and how much needs thinking about the boundaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes definitely. Once I get all the existing tests passing I'll make an issue of the edge cases to handle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
…mn/pandas_handling
…mn/pandas_handling
…mn/pandas_handling
Should be ready for another review. After another pass I don't think there's any outstanding todo's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor question and a suggestion to add a deprecation warning, but approving now because this looks in great shape.
# float16 | ||
arbitrary = arbitrary.astype(np.dtype(np.float32)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion:
Given that the pandas-2 behaviour raises, let us take this opportunity to add a deprecation warning here so that we can also raise once we're supporting pandas-2. (See https://docs.rapids.ai/api/cudf/nightly/developer_guide/contributing_guide/#deprecating-and-removing-code)
if nan_as_null is None or nan_as_null is True: | ||
data = build_column(buffer, dtype=arbitrary.dtype) | ||
data = _make_copy_replacing_NaT_with_null(data) | ||
mask = data.mask | ||
else: | ||
bool_mask = as_column(~np.isnat(arbitrary)) | ||
mask = as_buffer(bools_to_mask(bool_mask)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it turns NaT
into nulls, is that the intention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think we still want to consider NaT
values when creating the mask (as null values), but not necessarily cast the value to NA as tested in test_series_np_array_nat_nan_as_null_false
. cc @galipremsagar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, at one point of time we supported both NA
and NaT
for datetime & timedelta columns. Now we just treat NA
as NaT
and vice-versa for datetime & timedelta column. With recent pandas accelerator mode work, we decided to just repr out NA
as NAT
. So yes we need to mark the mask when we have NAT
's anywhere.
if nan_as_null is None or nan_as_null is True: | ||
data = build_column(buffer, dtype=arbitrary.dtype) | ||
data = _make_copy_replacing_NaT_with_null(data) | ||
mask = data.mask | ||
else: | ||
bool_mask = as_column(~np.isnat(arbitrary)) | ||
mask = as_buffer(bools_to_mask(bool_mask)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here.
Co-authored-by: GALI PREM SAGAR <[email protected]>
/merge |
Description
Currently
as_column
has a few different branches handling pandas objects. This PR consolidatesSeries
,Index
andExtensionArray
handling into 1if
branch such handling is consistent between the 3.This also disallows
float16
types passed intodtype
constructor arguments or typeddata
argumentsChecklist